In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt
Alright in this section we're going to continue with the running data set but we're going to dive a bit deeper into ways of analyzing the data including filtering, dropping rows, doing some groupings and that sort of thing.
So what we'll do is read in our csv file.
In [7]:
pd.read_csv?
In [8]:
list(range(1,7))
Out[8]:
In [13]:
df = pd.read_csv('../data/date_fixed_running_data_with_time.csv', parse_dates=['Date'], usecols=list(range(0,6)))
In [14]:
df.dtypes
Out[14]:
In [15]:
df.sort(inplace=True)
df.head()
Out[15]:
Now let's about getting some summary statistics. I would encourage you to try these on your own. We've learned pretty much everything we need to in order to be able to do these on without guidance and as always if you need clarification just ask on the side.
What was the longest run in miles and minutes that I ran?
In [16]:
df.Minutes.max()
Out[16]:
In [17]:
df.Miles.max()
Out[17]:
What about the shortest in miles and minutes that I ran?
In [18]:
df.Minutes.min()
Out[18]:
In [19]:
df.Miles.min()
Out[19]:
We forgot to ignore our null values, so how would we do it by ignoring those?
In [20]:
df.Miles[df.Miles > 0].min()
Out[20]:
What was the most common running distance I did excluding times when I didn't run at all.
In [21]:
df.Miles[df.Miles > 0].value_counts().index[0]
Out[21]:
Plot a graph of the cumulative running distance in this dataset.
In [22]:
df.Miles.cumsum().plot()
plt.xlabel("Day Number")
plt.ylabel("Distance")
Out[22]:
Plot a graph of the cumulative running hours in this data set.
In [23]:
(df.Minutes.fillna(0).cumsum() / 60).plot()
Out[23]:
Another interesting question we could ask is what days of the week do I commonly go for runs. Am I faster on certain days or does my speed improve over time relative to the distance that I'm running.
So let's get our days of the week
In [24]:
df.Date[0].strftime("%A")
Out[24]:
We will do that by mapping our date column to a the time format we need
In [25]:
df.Date.map(lambda x: x.strftime("%A")).head()
Out[25]:
then we just set that to a new column.
In [26]:
df['Day_of_week'] = df.Date.map(lambda x: x.strftime("%A"))
In [27]:
df.head(10)
Out[27]:
and we can make a bar plot of it, but let's see if we can distinguish anything unique about certain days of the week.
In [28]:
df[df.Miles > 0].Day_of_week.value_counts().plot(kind='bar')
Out[28]:
We will do that by creating groups of data frames
We can see that in this sample I run a lot more on the Friday Saturday and Monday. Some interesting patterns. Why don't we try looking at the means and that sort of thing.
But before we get there, at this point, our data frame is getting pretty messy and I think it's worth explaining how to remove columns and add remove rows and columns.
First let's remove the Time column - seeing as we already have minutes and seconds
In [29]:
del(df['Time'])
del will delete it in place
In [30]:
df.head()
Out[30]:
Finally we can use drop to drop a column. Now we have to specify the axis( we can also use this to drop rows), now this does not happen in place.
In [31]:
df.drop('Seconds',axis=1)
Out[31]:
we can also use drop to drop a specific row by specifying the 0 axis
In [32]:
tempdf = pd.DataFrame(np.arange(4).reshape(2,2))
tempdf
Out[32]:
In [33]:
tempdf.drop(1,axis=0)
Out[33]:
we Already saw how to create a new column, we can also create a new row using the append method. This takes in a data frame or Series and appends it to the end of the data frame.
In [34]:
tempdf.append(pd.Series([4,5]), ignore_index=True)
Out[34]:
In [35]:
df.head()
Out[35]:
We can also pop out a column which will remove it from a data frame and return the Series. You'll see that it happens in place.
In [36]:
df.pop('Seconds')
Out[36]:
In [37]:
df.head()
Out[37]:
Now we've made our dataset a bit more manageable. We've kind of just got the basics of what we need to perform some groupwise analysis.
Now at this point we're going to do some groupings. This is an extremely powerful part of pandas and one that you'll use all the time.
pandas follows the the Split-Apply-Combine style of data analysis.
Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.
Hadley Wickhan from Rice University: http://www.jstatsoft.org/v40/i01/paper
Since we're going to want to check things in groups. What I'm going to do is try to analyze each day of the week to see if there are any differences in the types of running that I do on those days.
we'll start by grouping or data set on those weekdays. Basically creating a dictionary of the data where the key is the weekday and the value is the dataframe of all those values.
First let's do this the hard way....
In [39]:
for dow in df.Day_of_week.unique():
print(dow)
print(df[df.Day_of_week == dow])
break
This is clearly an ugly way to do this and pandas provides a much more simple way of approaching this problem. by creating a groupby object.
But first I'm going to filter out our zero values because they'll throw off our analysis.
In [40]:
df['Miles'] = df.Miles[df.Miles > 0]
In [42]:
dows = df.groupby('Day_of_week')
print(dows)
We can get the size of each one by using the size command. This basically tells us how many items are in each category.
In [43]:
dows.size()
Out[43]:
In [44]:
dows.count()
Out[44]:
Now we have our groups and we can start doing groupwise analysis, now what does that mean?
It means we can start answering questions like what is the average speed per weekday or what is the total miles run per weekday?
In [45]:
dows.mean()
Out[45]:
In [46]:
dows.sum()
Out[46]:
It might be interesting to see the total sum of the amount of runs to try and see any outliers simply because Thursday, Friday, Saturday are close in distances, relatively, but not so much in speed.
We also get access to a lot of summary statistics from here that we can get from the groups.
In [47]:
dows.describe()
Out[47]:
In [48]:
df.groupby('Day_of_week').mean()
Out[48]:
In [49]:
df.groupby('Day_of_week').std()
Out[49]:
iterating through the groups is also very straightforward
In [51]:
for name, group in dows:
print(name)
print(group)
you can get specific groups by using the get_group method.
In [52]:
dows.get_group('Friday')
Out[52]:
We can use an aggregation command to perform an operation to get all the counts for each data frame.
In [53]:
dows.agg(lambda x: len(x))['Miles']
Out[53]:
another way to do this would be to add a count column to our data frame, then sum up each column
In [54]:
df['Count'] = 1
df.head(10)
Out[54]:
In [55]:
df.groupby('Day_of_week').sum()
Out[55]:
That about covers everything that I wanted to regarding pandas basics for grouping, filtering and that sort of thing. This is really the basics and likely a lot of what you'll be using on a regular basis. In the next section we will dive a bit deeper into more intensive analysis or more detailed function usage.
In [ ]: